library(tidyverse)
library(here)
library(maps)
library(plotly)
library(broom)
library(colorspace)
world_df <- map_data("world")Classifier for Under-Developed Countries in the World Using Neural Networks
Abstract
This project could represent the minimal work conducted on underdeveloped countries and subsequently answer the question for the existence of sufficient data regarding underdeveloped countries, with the answer being a loud “NO”. Utilizing variables from The World Bank Data Repository, the project explores socio-economic indicators such as agricultural land, electricity access, fertility rate, internet usage, sanitation, population growth, primary school enrollment, and total unemployment. Despite challenges in data availability, logistic regression and neural network models are employed to predict a country’s development status. Initial attempts with logistic regression reveal limited success due to the complexity of relationships between variables. A neural network model is subsequently developed using a Collab Notebook, demonstrating promising results. The project concludes with insights into the need for more comprehensive and more data as a quantity, and the potential of advanced modeling techniques to address underdevelopment effectively.
Introduction
Being an intern with the Sub-Saharan Africa Poverty Team with the World Bank Group last summer, I realized that the lack of data regarding underdeveloped countries is something that needs to be addressed. This insufficient provision of data does not allow extensive research to be conducted and therefore lead to detailed solutions.
The purpose of this project is to “effectively” construct a classification model for underdeveloped countries, given a set of variables from The World Bank Data Repository. The initial selection of variables was indeed challenging, since most data sets would not contain enough data for underdeveloped countries, with NA values overwhelming the corresponding rows. Because of that, I decided to focus on the 21st century and ignore data before the year 2000. Two primary analytical approaches employed in this project are: logistic regression and neural networks.
Logistic regression, a classical statistical method, is utilized to model the probability of a country being classified as underdeveloped based on its socio-economic indicators. By fitting a logistic regression model to the data, the project seeks to identify the key variables that significantly influence a country’s development status. This approach provides interpretable results and insights into the relative importance of different indicators.
In addition to logistic regression, the project also explores the application of neural networks, a more advanced machine learning technique. Neural networks offer the advantage of capturing complex, nonlinear relationships between predictors and the response variable. By designing and training neural network models, the project aims to uncover nuanced patterns and interactions within the data that may not be captured by traditional linear models like logistic regression.
Load the Materials
electricity_access <- read_csv("data/access_to_electricity/access_to_electricity.csv",
skip = 4) |>
select(-c(3:44), -c(67:69))countries <- c("Afghanistan", "Angola", "Bangladesh", "Benin", "Burkina Faso", "Burundi", "Cambodia", "Central African Republic", "Chad", "Comoros", "Congo, Dem. Rep.", "Djibouti", "Eritrea", "Ethiopia", "Gambia, The", "Guinea", "Guinea-Bissau", "Haiti", "Kiribati", "Lao PDR", "Lesotho", "Liberia", "Madagascar", "Malawi", "Mali", "Mauritania", "Mozambique", "Myanmar", "Nepal", "Niger", "Rwanda", "Sao Tome and Principe", "Senegal", "Sierra Leone", "Solomon Islands", "Somalia", "South Sudan", "Sudan", "Tanzania", "Timor-Leste", "Togo", "Tuvalu", "Uganda", "Yemen, Rep.", "Zambia")electricity_stats <- electricity_access |>
pivot_longer(c(3:24),
names_to = "Year",
values_to = "Electricity Access") |>
filter(!is.na(`Electricity Access`)) |>
group_by(`Country Name`) |>
summarise(`mean_elec_access` = mean(`Electricity Access`, na.rm = TRUE),
`sd_elec_access` = sd(`Electricity Access`, na.rm = TRUE))underdeveloped_electricity <- electricity_access |>
pivot_longer(c(3:24),
names_to = "Year",
values_to = "Electricity Access") |>
filter(!is.na(`Electricity Access`)) underdeveloped_map <- left_join(underdeveloped_electricity, world_df, by = c("Country Name"="region"))plot_1 <- ggplot()+
geom_polygon(data = world_df, mapping = aes(x = long, y = lat, group = group, label = region), fill = "grey")+
geom_polygon(data = underdeveloped_map, mapping = aes(x = long, y = lat, group = group, fill = `Electricity Access`, label = `Country Name`))+
scale_fill_continuous_sequential(palette = "Heat")+
theme_minimal()
ggplotly(plot_1, tooltip = "label")agricultural_land <- read_csv("data/agricultural_land/agricultural_land.csv",
skip = 4) |>
select(-c(3:44), -c(67:69))agriculture_land_stats <- agricultural_land |>
pivot_longer(c(3:24),
names_to = "Year",
values_to = "Agricultural Land") |>
filter(!is.na(`Agricultural Land`)) |>
group_by(`Country Name`) |>
summarise(`mean_agr_land` = mean(`Agricultural Land`, na.rm = TRUE),
`sd_agr_land` = sd(`Agricultural Land`, na.rm = TRUE))population_growth <- read_csv("~/Desktop/Sixth Semester/ds334_final_project/ds334_final_project/data/population_growth_annual/population_growth.csv",
skip = 4) |>
select(-c(3:44), -c(67:69))population_growth_stats <- population_growth |>
pivot_longer(c(3:24),
names_to = "Year",
values_to = "Population Growth Rate") |>
filter(!is.na(`Population Growth Rate`)) |>
group_by(`Country Name`) |>
summarise(`mean_pop_growth` = mean(`Population Growth Rate`, na.rm = TRUE),
`sd_pop_growth` = sd(`Population Growth Rate`, na.rm = TRUE))primary_school_enrol <- read_csv("~/Desktop/Sixth Semester/ds334_final_project/ds334_final_project/data/primary_school_enrollment/primary_school.csv",
skip = 4) |>
select(-c(3:44), -c(67:69))primary_school_enrol_stats <- primary_school_enrol |>
pivot_longer(c(3:24),
names_to = "Year",
values_to = "Primary School Enrollment Rate") |>
filter(!is.na(`Primary School Enrollment Rate`)) |>
group_by(`Country Name`) |>
summarise(`mean_prim_school` = mean(`Primary School Enrollment Rate`, na.rm = TRUE),
`sd_prim_school` = sd(`Primary School Enrollment Rate`,na.rm = TRUE))total_unemployment <- read_csv("~/Desktop/Sixth Semester/ds334_final_project/ds334_final_project/data/total_unemployment/total_unemployment.csv",
skip = 4) |>
select(-c(3:44), -c(67:69))total_unemployment_stats <- total_unemployment |>
pivot_longer(c(3:24),
names_to = "Year",
values_to = "Total Unemployment") |>
filter(!is.na(`Total Unemployment`)) |>
group_by(`Country Name`) |>
summarise(`mean_total_unempl` = mean(`Total Unemployment`, na.rm = TRUE),
`sd_total_unempl` = sd(`Total Unemployment`, na.rm = TRUE))sanitation <- read_csv("data/basic_sanitation_services/basic_sanitation.csv",
skip = 4) |>
select(-c(3:44), -c(67:69))sanitation_stats <- sanitation |>
pivot_longer(c(3:24),
names_to = "Year",
values_to = "Sanitation") |>
filter(!is.na(`Sanitation`)) |>
group_by(`Country Name`) |>
summarise(`mean_sanit` = mean(`Sanitation`, na.rm = TRUE),
`sd_sanit` = sd(`Sanitation`, na.rm = TRUE))fertility_rate <- read_csv("data/fertility_rate/fertility_rate.csv",
skip = 4) |>
select(-c(3:44), -c(67:69))fertility_rate_stats <- fertility_rate |>
pivot_longer(c(3:24),
names_to = "Year",
values_to = "Fertility Rate") |>
filter(!is.na(`Fertility Rate`)) |>
group_by(`Country Name`) |>
summarise(`mean_fert_rate` = mean(`Fertility Rate`, na.rm = TRUE),
`sd_fert_rate` = sd(`Fertility Rate`, na.rm = TRUE))internet <- read_csv("data/internet/internet.csv",
skip = 4) |>
select(-c(3:44), -c(67:69))internet_stats <- internet |>
pivot_longer(c(3:24),
names_to = "Year",
values_to = "Internet") |>
filter(!is.na(`Internet`)) |>
group_by(`Country Name`) |>
summarise(`mean_inter` = mean(`Internet`, na.rm = TRUE),
`sd_inter` = sd(`Internet`, na.rm = TRUE))birth_life_exp <- read_csv("data/life_expectancy_birth/life_expectancy_birth.csv",
skip = 4) |>
select(-c(3:44), -c(67:69))birth_life_exp_stats <- birth_life_exp |>
pivot_longer(c(3:24),
names_to = "Year",
values_to = "Life Expectancy at Birth") |>
filter(!is.na(`Life Expectancy at Birth`)) |>
group_by(`Country Name`) |>
summarise(`mean_life_exp` = mean(`Life Expectancy at Birth`, na.rm = TRUE),
`sd_life_exp` = sd(`Life Expectancy at Birth`, na.rm = TRUE))full_stats_df <- agriculture_land_stats |>
left_join(birth_life_exp_stats, by = "Country Name") |>
left_join(electricity_stats, by = "Country Name") |>
left_join(fertility_rate_stats, by = "Country Name") |>
left_join(internet_stats, by = "Country Name") |>
left_join(sanitation_stats, by = "Country Name") |>
left_join(population_growth_stats, by = "Country Name") |>
left_join(primary_school_enrol_stats, by = "Country Name") |>
left_join(total_unemployment_stats, by = "Country Name") full_stats_df <-
full_stats_df |>
mutate(Underdeveloped = ifelse(`Country Name` %in% countries, 1, 0))Variable Description
| Variable | Description |
|---|---|
| agriculture_land | The share of land area that is arable, under permanent crops, and under permanent pastures |
| birth_life_exp | The number of years a newborn infant would live if prevailing patterns of mortality at the time of its birth were to stay the same throughout its life |
| electricity_access | The percentage of population with access to electricity |
| fertility_rate | The number of children that would be born to a woman if she were to live to the end of her childbearing years and bear children in accordance with age-specific fertility rates of the specified year |
| internet | The percentage of population that uses the internet |
| population_growth | Annual population growth rate for year t is the exponential rate of growth of midyear population from year t-1 to t, expressed as a percentage |
| primary_school_enrol | Gross enrollment ratio is the ratio of total enrollment, regardless of age, to the population of the age group that officially corresponds to the level of education shown |
| sanitation | Percentage of people using at least basic sanitation services, that is, improved sanitation facilities that are not shared with other households |
| total_unemployment | The share of the labor force that is without work but available for and seeking employment |
Data Set Investigation
library(GGally)
ggpairs(data = full_stats_df, columns = c(2, 4, 6, 8, 10, 12, 14, 16, 18))Logistic Regression Model Attempt
# Get the means column for each variable
full_means_df <- full_stats_df |>
select(contains("mean"))
numeric_columns <- sapply(full_means_df, is.numeric)
scaled_data <- scale(full_means_df[, numeric_columns])
scaled_data_with_undev <- scaled_data |>
as.data.frame() |>
mutate(Underdeveloped = full_stats_df$Underdeveloped) |>
na.omit()median_agr_land = median(scaled_data_with_undev$mean_agr_land, na.rm = TRUE)
median_life_exp = median(scaled_data_with_undev$mean_life_exp, na.rm = TRUE)
median_elec_access = median(scaled_data_with_undev$mean_elec_access, na.rm = TRUE)
median_fert_rate = median(scaled_data_with_undev$mean_fert_rate, na.rm = TRUE)
median_sanit = median(scaled_data_with_undev$mean_sanit, na.rm = TRUE)
median_population_growth = median(scaled_data_with_undev$mean_pop_growth, na.rm = TRUE)
median_primary_school = median(scaled_data_with_undev$mean_prim_school, na.rm = TRUE)
median_total_unempl = median(scaled_data_with_undev$mean_total_unempl, na.rm = TRUE)
median_internet = median(scaled_data_with_undev$mean_inter, na.rm = TRUE)library(modelr)
## First Attempt
# Fit the logistic regression model with all the variables
model_glm <- glm(Underdeveloped ~ .,
data = scaled_data_with_undev, family = "binomial")
# Check the summary of the model
summary(model_glm)
Call:
glm(formula = Underdeveloped ~ ., family = "binomial", data = scaled_data_with_undev)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.1249 1.2098 -4.236 2.27e-05 ***
mean_agr_land 0.6680 0.4263 1.567 0.11707
mean_life_exp 2.3677 0.8656 2.735 0.00624 **
mean_elec_access -1.9810 0.8652 -2.290 0.02204 *
mean_fert_rate 2.6992 0.9042 2.985 0.00283 **
mean_inter -2.9287 1.4195 -2.063 0.03910 *
mean_sanit -0.8134 0.8092 -1.005 0.31476
mean_pop_growth -2.4035 1.0252 -2.344 0.01906 *
mean_prim_school 0.2884 0.2613 1.104 0.26964
mean_total_unempl 0.4911 0.3008 1.633 0.10252
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 217.431 on 226 degrees of freedom
Residual deviance: 76.759 on 217 degrees of freedom
AIC: 96.759
Number of Fisher Scoring iterations: 8
# Identify numeric columns
numeric_cols <- sapply(scaled_data_with_undev, is.numeric)
# Compute range for each numeric column, removing NA values
ranges <- sapply(scaled_data_with_undev[, numeric_cols], function(x) {
x <- na.omit(x)
c(min(x), max(x))
})underdeveloped <- scaled_data_with_undev |>
filter(Underdeveloped == 1)developed <- scaled_data_with_undev |>
filter(Underdeveloped != 1) ## Second Attempt
# Fit the logistic regression model with selected variables using their range and the median for the rest
# Used the first model to come up with the most "significant" ones
model_glm <- glm(Underdeveloped ~ .,
data = scaled_data_with_undev, family = "binomial")
# Create the grid for prediction
grid <- full_stats_df |>
data_grid(
mean_agr_land = median_agr_land,
mean_life_exp = seq_range(c(-2.627905, 1.588857), n = 20),
mean_elec_access = seq_range(c(-2.6785616, 0.7295024), n = 5),
mean_fert_rate = seq_range(c(-1.274220, 3.082338), n = 20),
mean_sanit = median_sanit,
mean_inter = median_internet,
mean_pop_growth = median_population_growth,
mean_prim_school = median_primary_school,
mean_total_unempl = median_total_unempl
)
# Predict probabilities that a country is Underdeveloped
aug_model <- augment(model_glm, newdata = grid, se_fit = TRUE) |>
mutate(.predprob = plogis(.fitted))
# Visualize the predictions
ggplot(data = aug_model, aes(x = mean_fert_rate, y = .predprob)) +
geom_line(aes(color = as.factor(round(mean_life_exp, 2)))) +
facet_wrap(~as.factor(round(mean_elec_access, 2))) +
labs(x = "Standardized Mean Fertility Rate", y = "Predicted Probability", color = "Standardized Mean Electricity Access") +
theme_minimal()## Third Attempt
# Fit the logistic regression model with selected variables using their range and the median for the rest
# Used the first model to include the ones that were significant but not as much
grid <- full_stats_df |>
data_grid(
mean_agr_land = median_agr_land,
mean_life_exp = seq_range(c(-2.627905, 1.588857), n = 10),
mean_elec_access = median_elec_access,
mean_fert_rate = seq_range(c(-1.274220, 3.082338), n = 10),
mean_sanit = median_sanit,
mean_inter = median_internet,
mean_pop_growth =seq_range(c(-2.022998, 4.438928), n = 3),
mean_prim_school = median_primary_school,
mean_total_unempl = median_total_unempl
)
# Predict probabilities for the grid
aug_model <- augment(model_glm, newdata = grid, se_fit = TRUE) |>
mutate(.predprob = plogis(.fitted))
# Visualize the predictions
ggplot(data = aug_model, aes(x = mean_fert_rate, y = .predprob)) +
geom_line(aes(color = as.factor(round(mean_life_exp, 2)))) +
facet_wrap(~mean_pop_growth) +
labs(x = "Standardized Mean Fertility Rate", y = "Predicted Probability", color = "Standardized Mean Life Expectancy") +
theme_minimal()## Forth Attempt
# Fit the logistic regression model with selected variables using their range and the median for the rest
# Used the first model to include the ones that were significant but not as much
grid <- full_stats_df |>
data_grid(
mean_agr_land = median_agr_land,
mean_life_exp = median_life_exp,
mean_elec_access = median_elec_access,
mean_fert_rate = seq_range(c(-1.308798, 2.850897), n = 10),
mean_sanit = median_sanit,
mean_inter = seq_range(c(-1.511381, 2.405766), n = 10),
mean_pop_growth =seq_range(c(-2.221712, 3.779502), n = 3),
mean_prim_school = median_primary_school,
mean_total_unempl = median_total_unempl
)
# Predict probabilities for the grid
aug_model <- augment(model_glm, newdata = grid, se_fit = TRUE) |>
mutate(.predprob = plogis(.fitted))
# Visualize the predictions
ggplot(data = aug_model, aes(x = mean_fert_rate, y = .predprob)) +
geom_line(aes(color = as.factor(round(mean_inter, 2)))) +
facet_wrap(~mean_pop_growth) +
labs(x = "Standardized Mean Fertility Rate", y = "Predicted Probability", color = "Standardized Mean Population Growth") +
theme_minimal()In general, what we have seen from the above plots is that attempting to fit a Logistic Regression Model given different combinations of proxies does not do a good job. Therefore, a more complex model might be able to interpret the relationships between the various variables that we have.
Used Collab Notebook to Construct and Fit a Neural Network
Take home message: Not enough data, how does the model do with developed countries
total_countries <- full_stats_df |>
select(`Country Name`)full_means_df <- full_means_df |>
cbind(total_countries)total_countries_no_na <- full_means_df |>
na.omit() |>
select(`Country Name`)scaled_data_with_undev_arranged <- scaled_data_with_undev |>
cbind(total_countries_no_na) |>
arrange(mean_agr_land)|>
select(mean_agr_land, `Country Name`)predictions <- read.csv("data/predicted_dataset.csv")predictions <- predictions |>
arrange(mean_agr_land) |>
cbind(`Country Name` = scaled_data_with_undev_arranged$`Country Name`)First Map with the Original Data Set
underdeveloped_map <- world_df |>
mutate(Underdeveloped = as.factor(ifelse(region %in% countries, 1, 0)))plot_1 <- ggplot()+
geom_polygon(data = world_df, mapping = aes(x = long, y = lat, group = group, label = region), fill = "grey")+
geom_polygon(data = underdeveloped_map, mapping = aes(x = long, y = lat, group = group, fill = Underdeveloped, label = region))+
scale_fill_manual(values = c("0" = "grey", "1" = "darkblue")) +
theme_minimal()+
theme(
legend.position = "none"
)
ggplotly(plot_1, tooltip = "label")Second Map with the Predicted Data Set
predictions <- predictions |>
mutate(Underdeveloped = ifelse(Predicted_output >= 0.5, 1, 0))predicted_full_df <- left_join(predictions, world_df, by = c("Country Name" = "region"))plot_2 <- ggplot()+
geom_polygon(data = world_df, mapping = aes(x = long, y = lat, group = group, label = region), fill = "grey")+
geom_polygon(data = predicted_full_df, mapping = aes(x = long, y = lat, group = group, fill = as.factor(Underdeveloped),
label = `Country Name`))+
scale_fill_manual(values = c("0" = "grey", "1" = "darkred")) +
theme_minimal()+
theme(
legend.position = "none"
)
ggplotly(plot_2, tooltip = "label")Table with the Number of Developed and Underdeveloped, comparing the actual numbers to the ones predicted from the neural network
summary_actual <- scaled_data_with_undev |>
summarise(Underdeveloped = sum(Underdeveloped),
Developed = n() - Underdeveloped)summary_predicted <- predictions |>
summarise(Underdeveloped = sum(Underdeveloped),
Developed = n() - Underdeveloped)library(pander)
# Convert summaries to data frames
summary_actual_df <- as.data.frame(summary_actual)
summary_predicted_df <- as.data.frame(summary_predicted)
# Add row names for clarity
row.names(summary_actual_df) <- "Actual"
row.names(summary_predicted_df) <- "Predicted"
combined_summary <- rbind(summary_actual_df, summary_predicted_df)
# Print combined summary using pander
pander(combined_summary, caption = "Summary of Actual and Predicted Classes")| Underdeveloped | Developed | |
|---|---|---|
| Actual | 42 | 185 |
| Predicted | 51 | 176 |